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Abstract 

Graph pattern matching is often defined in terms of sub- 
graph isomorphism, an NP-complete problem. To lower its 
complexity, various extensions of graph simulation have been 
considered instead. These extensions allow pattern match- 
ing to be conducted in cubic-time. However, they fall short 
of capturing the topology of data graphs, i.e., graphs may 
have a structure drastically different from pattern graphs 
they match, and the matches found are often too large to un- 
derstand and analyze. To rectify these problems, this paper 
proposes a notion of strong simulation, a revision of graph 
simulation, for graph pattern matching. (1) We identify a 
set of criteria for preserving the topology of graphs matched. 
We show that strong simulation preserves the topology of 
data graphs and finds a bounded number of matches. (2) 
We show that strong simulation retains the same complexity 
as earlier extensions of simulation, by providing a cubic-time 
algorithm for computing strong simulation. (3) We present 
the locality property of strong simulation, which allows us to 
effectively conduct pattern matching on distributed graphs. 
(4) We experimentally verify the effectiveness and efficiency 
of these algorithms, using real-life data and synthetic data. 

1. Introduction 

Graph pattern matching is being increasingly used in a 
number of applications, e.g., software plagiarism detection, 
biology, social networks and intelligence analysis [27, 32, 33, 
35]. Given a pattern graph Q and a data graph G, it is 
to find all subgraphs of G that match Q. Here matching 
is typically defined in terms of subgraph isomorphism (see, 
e.g., [4, 20] for surveys): a subgraph G 3 of G matches Q if 
there exists a bijective function f from the nodes of Q to the 
nodes in G s such that (1) for each pattern node u in Q, u 
and f(u) have the same label, and (2) there exists an edge 
(it, it') in Q if and only if (f(u), f(u')) is an edge in G s . 

However, subgraph isomorphism is an NP-complete prob- 
lem [34]. Moreover, there are possibly exponential many 
subgraphs in G that match Q. In addition, as observed in [6, 
19], it is often too restrictive to catch sensible matches, as 
it requires matches to have exactly the same topology as a 
pattern graph. These hinder its applicability in emerging 
applications such as social networks and crime detection. 
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Figure Is Social matching: query and data graphs 

To reduce the complexity, graph simulation [29] has been 
adopted for pattern matching. A graph G matches a pattern 
Q via graph simulation if there exists a binary relation S C 
Vq x V, where Vq and V are the set of nodes in Q and G, 
respectively, such that (1) for each (u,v) 6 S, u and v have 
the same label; and (2) for each node u in Q, there exists v 
in G such that (a) (u, v) G S, and (b) for each edge (u, u') 
in Q, there exists an edge (v, v') in G such that (u , v') £ S. 
Graph simulation can be determined in quadratic time [24]. 
Recently this notion has been extended by mapping edges 
in Q to (bounded) paths in G [19, 18], with a cubic-time 
complexity, to identify matches in, e.g., social networks. 

Nevertheless, the low complexity comes with a price: (1) 
simulation and its extensions [19, 18] do not preserve the 
topology of data graphs; in other words, they may match 
a graph G and a pattern Q with drastically different struc- 
tures. (2) The match relation S is often too large to under- 
stand and analyze, as illustrated below. 

Example 1: Consider a real-life example taken from [31]. 
A headhunter wants to find a biologist (Bio) to help a group 
of software engineers (SEs) analyze genetic data. To do 
this, she uses an expertise recommendation network Gi, as 
depicted in Fig. 1. In Gi, a node denotes a person labeled 
with expertise, an edge indicates recommendation, e.g.,HRi 
recommends Bioi, and there is an edge from each DM; to 
Bio3. The biologist Bio needed is specified with a pattern 
graph Qi, also shown in Fig. 1. Intuitively, the Bio has to 
be recommended by: (a) an HR person; (b) an SE, i.e., the 
Bio has experience working with SEs; and (c) a data mining 
specialist (DM), as data mining techniques are required for 
the job. Moreover, (d) the SE is also recommended by an HR 
person, and (e) there is an artificial intelligence expert (Al) 
who recommends the DM and is recommended by a DM. 

When subgraph isomorphism is used, no match can be 
found, i.e., no subgraph of Gi is isomorphic to Qi. In other 
words, subgraph isomorphism imposes too strict a constraint 
on the topology of the graphs matched. 

When graph simulation or its extensions [19, 18] are 
adopted, all four biologists in Gi are matches for Bio in 
Qi. However, Bioi and Bio2 are recommended by either HR 
only or by SE only in Gi, and Bio3 is recommended by nei- 
ther an HR nor an SE. Hence these are not the ones that 
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the headhunter really wants. Only Bio4 satisfies all these 
conditions and makes a good candidate. 

This tells us that simulation and its extensions [19, 18] 
do not preserve the structural properties in graph pattern 
matching and therefore, may return excessive "matches" 
that one does not want. Indeed, observe the following. 
Topological structure, (a) While Q\ is a connected graph, 
Gi is disconnected, but Gi matches Qi via simulation, (b) 
Node Bio in Qi has three "parents", but it matches nodes 
Bioi and Bio2 in Gi that have a single "parent" each, (c) The 
directed cycle with only two nodes Al and DM in Qi matches 
the long cycle consisting of All, DMi, . . ., AU, DMfe, All in 
G\, and the undirected cycle with nodes HR, SE and Bio in 
Qi matches the tree rooted at HRi in Gi. 
Match results . The match relation of simulation, when rep- 
resented as a result graph as suggested in [19], is the entire 
graph Gi . In general, the result graphs are often large when 
matching is performed on real-life networks, e.g.. Linkedln 
[1], which has 19. 5M users and yields a graph of 100GB in 
size. These make it hard to analyze the match results. □ 

These suggest that we revise the notion of graph simula- 
tion to strike a balance between its computational complex- 
ity and its ability to capture the topology of graphs. Indeed, 
graph simulation was proposed for process algebra to mimic 
steps of a process [29] . To make practical use of it in graph 
pattern matching, we need to enhance it by incorporating 
more topological structure of graphs. 

Contributions & Roadmap. We introduce a revision of 
graph simulation that preserves the topology of graphs and 
has the same complexity as extensions [19, 18] of simulation. 

(1) We propose a revision of graph simulation [29] (Sec- 
tion 2), referred to as strong simulation, by enforcing two 
conditions: (a) the duality to preserve the parent relation- 
ships and (b) the locality to eliminate excessive matches. 
For example, matching Q\ on Gi of Fig. 1 via strong simu- 
lation finds Bio 4 as the only match for Bio in Q\. 

(2) We identify a set of criteria for topology preservation, 
and show that strong simulation preserves the topology of 
pattern graphs and data graphs (Section 3). We also prove 
that the number of matches via strong simulation is linear in 
the size of the data graph rather than exponential for sub- 
graph isomorphism, and each match has a diameter bounded 
by the diameter of the pattern graph. Hence strong simula- 
tion indeed rectifies the problems of subgraph isomorphism 
and simulation. Moreover, we show that slight extensions 
to the notion make graph pattern matching intractable. 

(3) We show that strong simulation retains the same com- 
plexity as earlier extensions [19, 18] by providing a cubic- 
time computation algorithm (Section 4). We also develop 
effective optimization techniques, notably a quadratic-time 
algorithm to minimize strong simulation queries. 

In addition, we show that the locality of strong simula- 
tion allows us to develop a simple yet effective algorithm 
to find matches in distributed graphs. To the best of our 
knowledge, this is among the first distributed algorithms for 
graph pattern matching, for which the need is evident when 
processing massive graphs (see e.g., [15, 21, 28]). 

(4) Using both real-life data (Amazon and YouTube) and 
synthetic data, we conduct an extensive experimental study 
(Section 5). We find that our algorithms for strong simula- 



tion scale well with large data graphs {e.g., with 1.5 x 10 8 
nodes). They are able to identify sensible matches that sub- 
graph isomorphism fails to catch, and to eliminate excessive 
matches of graph simulation that do not make sense. We 
find 70%-80% matches found by strong simulation are those 
found by subgraph isomorphism, while only 25%-38% for 
graph simulation. We also find that our optimization tech- 
niques are effective, reducing 1/3 of running time in average. 

We contend that strong simulation provides a promising 
model for graph pattern matching in emerging application- 
s. Indeed, (1) in contrast to subgraph isomorphism, strong 
simulation is in cubic-time rather than NP-complete, and 
moreover, due to its locality, it yields a set of matches with 
cardinality linear in the size of the data graph rather than 
exponential, where each match is bounded by the diameter 
of the pattern graph. (2) As opposed to graph simulation, it 
captures the topology of patterns in its matches, such as par- 
ents, connectivity and cycles, by enforcing the duality and 
locality on matches, while it retains the same complexity 
as simulation. (3) Unlike simulation, the locality of strong 
simulation makes it possible to efficiently conduct pattern 
matching on distributed graphs. (4) As will be seen in Sec- 
tion 3, minor extensions to strong simulation would make 
graph pattern matching an intractable problem. In other 
words, strong simulation strikes a balance between the com- 
plexity and the capability to capture graph topology. 
Related work. There has been a host of work on pattern 
matching via subgraph isomorphism {e.g., [32, 33, 35]; see [4, 
20] for surveys). In light of its intractability, approximate 
matching has been studied to find inexact solutions, which 
allows node/edge mismatches [4, 32]. This work differs from 
approximate matching in that no node/edge mismatches are 
allowed, and that the number of matches via strong simu- 
lation is linear in the size of the data graph rather than 
exponential for (approximate) subgraph isomorphism. 

Closer to this work are bounded simulation [19] and graph 
pattern queries of [18]. The former extends simulation [29] 
by allowing bounds on the number of hops in pattern graphs, 
and the latter further extends [19] by incorporating regular 
expressions as edge constraints on pattern graphs. Pattern 
matching via these extensions are in cubic-time [18, 19]. As 
remarked earlier, these notions of simulation may fail to cap- 
ture the topology of graphs, and yield false matches or too 
large a match relation. These are precisely the problems 
that strong simulation aims to rectify, by imposing addition- 
al constraints (duality and locality) on graph simulation. 

Schema extraction is to discover the implicit structure of 
semi-structured data, which has no schema predefined. It 
has proved effective in query formulation and optimization 
[2, 22]. Schema of semi-structured data is often extracted vi- 
a a mild generalization of simulation that deals with labeled 
edges [2]. Nevertheless, topology preservation is not an issue 
in schema extraction, and no previous work there has stud- 
ied how simulation should be refined to capture topology. 

Query minimization, as a classical optimization technique, 
has been well studied for SQL [3], XPath {e.g., [10]), graph 
simulation [8] and graph pattern queries [18]. This work 
explores it for pattern matching via strong simulation. 

Distributed query processing has been studied for rela- 
tional data [26] and XML [9, 11]. There has also been recent 
work on distributed graph processing to manage large-scale 
graphs [15, 21, 28]. However, to the best of our knowl- 
edge, no previous work has studied distributed computation 
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of graph simulation [29] and its extensions [19, 18], not to 
mention strong simulation proposed in this work. 

2. Strong Simulation 

In this section, we first present basic notations of graphs. 
We then introduce the notion of strong simulation. 

2.1 Preliminaries 

We specify both data graphs and pattern graphs as fol- 
lows. Let E be a (possibly infinite) set of labels. 

Graphs. A node-labeled directed graph (or simply a graph) is 
defined as G(V, E, I), where (1) V is a finite set of nodes; (2) 
E C V x V is a finite set of edges, in which (u,u') denotes 
an edge from nodes u to u'\ and (3) I is a labeling function 
that maps each node u in V to a label l(u) in E. We denote 
G as (V, E) when it is clear from the context. 

Intuitively, the function l() specifies node attributes, e.g., 
keywords, blogs, comments, ratings, names, emails, compa- 
nies [5]; and the label set E denotes all such attributes. 

We next review some basic notations of graphs. 

Subgraphs. Graph H(V S , E s ,Ih) is a subgraph of graph 
G(V, E, Ig), denoted as G[V S , E B ], if (1) for each node u 6 V s , 
u £ V and Ih(u) = Ig( u ), an d (2) for each edge e £ E s , 
e £ E. That is, subgraph G[V 3 ,E 3 ] only contains a subset 
of nodes and a subset of edges of G. 

Paths . A directed (resp. undirected) path p is a sequence of 
nodes vi,...,v n such that (resp. either (vi,Vi+i) 

or (vi + i,Vi)) is an edge in G for i £ [1, n — 1]. The length of 
p is the number of edges in p. Abusing notations for trees, 
we refer to Vi+i as a child of Vi (or Vi as a parent of Vi+i). 

A directed (resp. undirected) cycle in a graph is a directed 
(resp. undirected) path with vi — v n , having no repeated 
nodes other than the start and end nodes V\ and v n . 

A graph is connected if for each pair of nodes, there exists 
an undirected path in the graph. 

Connected components. A connected component of a graph 
is a subgraph in which any two nodes are connected to each 
other by undirected paths, and which is connected to no ad- 
ditional nodes. A graph that is itself connected has exactly 
one connected component, consisting of the entire graph. 

Distance and diameter . Given two nodes it, v in a graph G, 
the distance from u to v, denoted by dist(u,«), is the length 
of the shortest undirected path from u to v in G. 

The diameter of a connected graph G, denoted by da, is 
the longest shortest distance of all pairs of nodes in G, i.e., 
do = max(dis(u, v)) for all nodes u,v in G. 

We assume w.l.o.g. that pattern graphs are connected. 

2.2 The Definition of Strong Simulation 

We define strong simulation by enforcing two conditions 
on simulation [29]: duality and locality. As will be seen in 
Sections 3 and 4, these conditions capture the topology of 
graphs and eliminate excessive matches to a maximum ex- 
tent, while retaining a low ptime computational complexity. 

Consider a pattern graph Q(V q , E q ) and a data graph 
G(V,E). To define strong simulation, we first review the 
notion of graph simulation [29]. Graph G matches pattern 
Q via graph simulation, denoted by Q -< G, if there exists 
a binary match relation S C V q x V such that (1) for each 
(u,v) £ S, u and v have the same label, i.e., Iq(u) = Ig{v); 
and (2) for each node u in V q , there exists v in V such that 
(a) {u,v) € S, and (b) for each edge (u, v!) £ E q , there 



exists an edge (v,v') in E such that (u',v') £ S. 

Intuitively, simulation preserves the labels and the child 
relationship of a graph pattern in its match. Simulation was 
proposed for the analyses of programs [29], and studied for 
schema extraction from semi-structured data [2]. Simula- 
tion and its extensions were recently introduced for social 
networks [6], and for graph pattern matching [19, 18] due to 
its low ptime computational complexity [24]. 

To capture graph topology, we extend simulation by en- 
forcing duality, to preserve the parent relationship as well. 

Dual simulation. Pattern graph Q matches data graph G 
via dual simulation, denoted by Q -<d G, if Q -< G with a 
binary match relation S C V q x V , and moreover, for each 
pair (u,v) £ S and each edge (u2,u) in E q , there exists an 
edge (v2,v) in E with (u2,«2) £ S. 

Intuitively, dual simulation enhances graph simulation by 
imposing an additional condition, to preserve both child and 
parent relationships (downward and upward mappings). 

While there may be multiple matches in a graph G for a 
pattern Q, there exists a unique maximum match Sm in G 
for Q such that for any match S in G for P, S C Sm. 

Lemma 1: For any pattern graph Q and data graph G with 
Q <d G, there is a unique maximum match relation. □ 

Locality. To define the locality, we need some notions. 

Balls . For a node v in a graph G and a non-negative integer 
r, the ball with center v and radius r is a subgraph of G, 
denoted by G[v,r], such that (1) for all nodes v' in G[v,r], 
the shortest distance dist(w,i/) < r, and (2) it has exactly 
the edges that appear in G over the same node set. 

We define the locality by requiring matches to be within 
a ball of a certain radius. Indeed, as observed in [7], when 
social distance increases, the closeness of relationships de- 
creases and the relationships may become irrelevant. Hence 
it often suffices in practice to consider only those matches of 
a pattern that fall in a small ball. To formalize this, we use 
the notion of match graphs of a pattern, given as follows. 

Match graphs. Consider a relation S C V q x V. The match 
graph w.r.t. S is a subgraph G[V s ,E a ] of G, in which (1) a 
node v £ V„ iff it is in S, and (2) an edge (v,v') £ E 3 iff there 
exists an edge (u,u) in Q with (u,v) £ S and (u',v') £ S. 
We are now ready to define strong simulation. 

Strong simulation. Pattern graph Q matches data graph 
G via strong simulation, denoted by Q -<£ G, if there exist 
a node v in G and a connected subgraph G s of G such that 
(1) Q <d G s , with the maximum match relation S; (2) G s 
is exactly the match graph w.r.t. S, and (3) G„ is contained 
in the ball G[v,do\, where g!q is the diameter of Q. 

We refer to G s as a perfect subgraph of G w.r.t. Q. 

Intuitively, a match G B of pattern Q is required to satisfy 
the following conditions: (1) it preserves both the child and 
parent relationships of Q (condition 1 above); (2) the nodes 
and edges needed to match Q are all contained in a ball 
of a radius decided by the diameter of Q (conditions 2 and 
3); this rules out excessively large matches. As will be seen 
shortly, these conditions are justified for preserving graph 
topology and retaining low computational complexity. 

Example 2: Consider pattern graph Qi and data graph Gi 
of Fig. 1. Observe the following. (1) No subgraph of Gi is 
isomorphic to Q 1 . Indeed, there exists no directed cycle in 
Gi that matches the direct cycle DM, Al, DM in Qi. 
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Figure 2: Strong simulation 



(2) When simulation is adopted, the entire data graph 
Gi is included in the match relation, which maps HR, 
SE, Bio, DM and Al in Qi to {HRi, HR 2 }, {SEi,SE 2 }, 
{Bioi, Bio 2 , Bio 3 , Bio 4 }, {DMi, DM 2 , DMi, DM k } and 
{Ali, Al 2 , Ali, . . . ,Al k } in Gi, respectively. 

(3) When it comes to strong simulation, the connected com- 
ponent G c of Gi that contains Bio 4 is the only match, which 
maps HR, SE, Bio, DM and Al in Q 1 to {HR 2 }, {SE 2 }, {Bio 4 }, 
{DMi, DM 2 } and {Ali, Al 2 } in Gi, respectively. Indeed, one 
can verify the following: (1) Qi -<d G c , and in its match re- 
lation, Bio in Qi can only be mapped to Bio 4 in Gi; and (b) 
the ball with center Bio 4 and radius 3 (the diameter of Qi) 
is exactly G c . As opposed to simulation, the cycle Ali, DMi, 
. . ., Alfc, DMfe, Ali in Gi is not part of the match. Indeed, 
this cycle is irrelevant and thus should be left out. 

As another example, consider pattern graphs Q 2 , Q$, Q4 
and data graphs G 2 , G3, G4 shown in Fig. 2. 

(4) Pattern Q 2 is to find a book recommended by both s- 
tudents (ST) and teachers (TE). When simulation is used, 
both booki and book 2 in G 2 are returned as matches, while 
booki is obviously not a good option. When strong simula- 
tion is adopted, book 2 is the only match by the duality, in 
a single match graph (union of G 2 ,i,G 2j2 in Fig. 2). When 
it comes to subgraph isomorphism, it returns two match 
graphs (G 2 ,i, G 2 , 2 ) instead of one, with book 2 as the match. 

(5) Pattern Q3 is to find people (P and P') who recommend 
each other. When simulation or dual simulation is used, all 
people (Pi, P 2 , P3 and P 4 ) in G3 are found as matches, while 
P 4 is obviously not a good choice. When strong simulation is 
adopted, Pi , P 2 and P3 are the only matches by the locality, 
in a single match graph (union of G 3 ,i , G 3 , 2 in Fig. 2) . These 
are also the matches found via subgraph isomorphism, in two 
match graphs (G3,i,G3, 2 ) instead of a single one. 

(6) Pattern Q4 is looking for papers on social networks (SN) 
cited by papers on databases (db), which in turn cite papers 
on graph theory (graph). When simulation is used, all papers 
on SN (SNi, SN 2 , SN3 and SN 4 ) in G4 are matches, while 
SN3 and SN 4 are obviously excessive matches. When strong 
simulation is adopted, SNi and SN 2 are the only matches due 
to the duality, returned in a single match graph (union of 
Gt,i,j with i,j £ [1, 2] in Fig. 2). These are also the matches 
found by subgraph isomorphism, yet returned in four match 
graphs (Gi.ij for i,j £ [1,2]) instead of one. □ 

Semantics. Strong simulation is to find, given any pattern 
graph Q and data graph G, the set of the maximum perfect 
subgraph G s in each ball such that Q -<d G s . 

By Lemma 1, one can verify the following, which assures 
that dual simulation and strong simulation are well defined. 

Theorem 1: For any pattern graph Q and data graph G 
such that Q -<j^ G, there exists a unique set of maximum 
perfect subgraphs for Q and G. □ 



Notations 


Semantics 


G[V s ,E a ] 


Subgraph of G with node and edge sets V s , E s 


G[v,r] 


A ball in a graph G with center v and radius r 


<, -< 


Subgraph isomorphism and graph simulation 




Dual simulation and strong simulation 



Table 1: Summary of notations 

Remark. (1) Duality and locality are also imposed by sub- 
graph isomorphism, but not by simulation. (2) One can 
readily extend strong simulation by supporting bounds on 
the number of hops and regular expressions as edge con- 
straints on pattern graphs, along the same lines as [19, 18]. 
We defer this to the full report to simplify the discussion. 

We summarize notations in Table 1, in which we use Q<G 
to denote that Q matches G via subgraph isomorphism. 

3. Properties of Strong Simulation 

Below we first identify a set of criteria for topology preser- 
vation in pattern matching and for bounded match results. 
Based on the criteria we then evaluate strong simulation, du- 
al simulation, subgraph isomorphism and graph simulation. 
Finally, we explore possible extensions to strong simulation 
and show that they lead to intractable problems. 

Consider a connected pattern graph Q = (V^, E q ) with 
diameter dQ and a data graph G = (V, E). 

3.1 Fundamental Properties 

First, one can readily verify that subgraph isomorphism 
is a stronger notion than the other three, followed by strong 
simulation, dual simulation and graph simulation in this or- 
der. Intuitively, subgraph isomorphism preserves most topo- 
logical structures between data graphs and pattern graphs. 

Proposition 1: (l)lfQ<\G, then Q G; (2) if Q ^ G, 

then Q -<d G; and (3) if Q -<d G, then Q -<G. □ 

We next take a closer look at what structures are pre- 
served by these matching notions, by giving a set of criteria. 

(1) Children. If a node u in the pattern graph Q matches 
node 11 in the data graph G, then each child of u must match 
a child of v. All these notions preserve the child relationship. 

(2) Parents. If a node u in Q matches node v in G, then 
each parent of u also matches a parent of v. One can ver- 
ify that subgraph isomorphism, strong simulation and dual 
simulation preserve the parent relationship, but simulation 
does not. A counterexample for simulation is given in Fig. 1. 

(3) Connectivity. A connected pattern graph only matches 
a connected subgraph in the data graph. 

As shown in Example 1, connected Qi may match a dis- 
connected data graph Gi via graph simulation. Dual sim- 
ulation prevents this, as shown below. From this it follows 
that the stronger notions subgraph isomorphism and strong 
simulation also preserve the connectivity. 

Theorem 2: If Q -<d G , then for any connected component 
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matching 


children 


parents 


connectivity 


cycles 


locality 


matches 


bisimilar&b'ed-cycle 


-< 
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X 
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-<D 
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/ (directed & undirected) 


X 


X 


X 


-< L 
r> 


/ 


/ 


/ 


/ (directed & undirected) 


/ 


/ 


X 


< 


/ 


/ 


/ 


/ (directed & undirected) 


/ 


X 


/ 



Table 2: Topology preservation and bounded matches 



G c of the match graph w.r.t. the maximum match relation 
of Q and G. (1) Q <d G c , and (2) G c is exactly the match 
graph w.r.t. the maximum match relation of Q and G c . □ 

Since G c is a connected component of the match graph 
and Q is assumed connected, all "relevant nodes" are in G c . 

(4) Cycles. An undirected (resp. directed) cycle in Q must 
match an undirected (resp. directed) cycle in G. 

We show that graph simulation preserves directed cycles, 
and hence so do the other three matching notions. 

Proposition 2: If Q -< G and there is a directed cycle in Q, 
then there must exist a matched directed cycle in the match 
graph w.r.t. any match relation of Q and G. □ 

However, as shown in Example 1, graph simulation may 
match an undirected cycle in a pattern with a tree in a data 
graph. In contrast, dual simulation (and subgraph isomor- 
phism and strong simulation) preserve undirected cycles. 

Theorem 3: IfQ-KoG and there is an undirected cycle in 
Q, then there must exist a matched undirected cycle in the 
match graph w.r.t. any match relation of Q and G. □ 

(5) Locality. The diameter of a matched subgraph in G must 
be bounded by a function in the size of the pattern graph. 
This allows us to check a match locally, by inspecting only 
its neighborhood of a bounded diameter. 

Strong simulation has the locality property, and so does 
subgraph isomorphism. In contrast, neither simulation nor 
dual simulation has the locality (see Examples 1 and 2). 

Proposition 3: If Q G, then for all perfect subgraphs 
G s of G, the diameter of G B is hounded by 2*dQ, where dQ 
is the diameter of Q. □ 

(6) Bounded matches. There should be a bounded number 
of matches, and each match is small enough to inspect. As 
remarked earlier, subgraph isomorphism may yield exponen- 
tially many matched subgraphs. While simulation and dual 
simulation return a single match relation, the relation is of- 
ten too large to understand. In contrast, strong simulation 
strikes a balance: the number of matches is bounded by 
the number of nodes in the data graph, and each matched 
subgraph has a bounded diameter (Proposition 3). 

Proposition 4: The number of maximum perfect subgraphs 
of G is bounded by the number of nodes in G. □ 

These results are summarized in Table 2. They tell us that 
strong simulation preserves much more topological struc- 
tures between pattern graphs and data graphs than graph 
simulation, and moreover, possesses the locality property. 

3.2 In Search for Tractable Boundary in Matching 

One might want to find a notion of graph pattern match- 
ing that preserves maximum graph topology, and charac- 
terize ptime along the same lines as how Fagin's theorem 
characterizes NP [30]. This is, however, very challenging. 
Indeed, as observed in [23] , in graph theory Fagin's theorem 
implies that "if no logic captures ptime, then ptime / np" . 



Below we present two negative results: extending strong 
simulation makes its computation from ptime to NP-hard. 

Bounded cycles. Given a pattern graph Q and a data graph 
G such that Q -< G with the maximum match relation S, the 
bounded cycle problem is to determine whether the longest 
cycle in the match graph w.r.t. S is bounded by the longest 
one in Q. Obviously bounded cycle is a desirable locality 
property that one would have wanted to further impose on 
strong simulation. Unfortunately, this additional condition 
would make pattern matching intractable. 

Theorem 4: The bounded cycle problem is coNP-hard even 
when pattern graphs contain a single cycle. □ 

Bisimulation . One might be tempted to use graph bisim- 
ulation [29] rather than graph simulation in graph pattern 
matching. A pattern graph Q matches a graph G 3 via bisim- 
ulation, denoted by Q ~ G s , if Q -< G 3 with the maximum 
match relation S and G„ -< Q with the inverse S~ of S as 
its maximum match relation. Pattern matching via bisim- 
ulation is to find all subgraphs G a of a graph G such that 
Q ~ G s . Clearly bisimulation preserves more topological 
structures than simulation. Indeed, it is a notion stronger 
than simulation but weaker than isomorphism. 

However, pattern matching via bisimulation becomes in- 
tractable. Indeed, subgraph bisimulation is NP-hard [17], 
although graph bisimulation is solvable in ptime [29]. In 
contrast, subgraph simulation is equivalent to graph simula- 
tion, i.e., checking whether there exists a subgraph G a of G 
such that Q -< G s is the same as checking whether Q -< G. 

4. An Algorithm for Strong Simulation 

We next show that graph pattern matching via strong sim- 
ulation retains the same complexity as earlier extensions [19, 
18] of simulation, while it is able to preserve graph topology 
better. The main results of this section are as follows. 

Theorem 5: For any pattern graph Q and data graph G, it 
takes cubic time to check whether Q G, and to find the 
set of maximum perfect subgraphs of G w.r.t. Q. □ 

Theorem 6: For any pattern graph Q with diameter dq, it 
takes Quadratic time to find a minimum pattern graph Q m 
such that Q m and Q find the same result on any data graph 
by using dQ as the radius of balls, via strong simulation. □ 

We first prove Theorem 5 by providing a cubic-time al- 
gorithm for computing strong simulation. We then show 
Theorem 6 by proposing optimization techniques. Finally, 
we briefly discuss how the locality of strong simulation al- 
lows us to conduct pattern matching on distributed graphs. 

4.1 A Cubic-time Algorithm 

Algorithm. The algorithm, refereed to as Match, is shown 
in Fig. 3. Given a pattern graph Q and a data graph G, it 
returns the set of perfect subgraph G 3 by inspecting those 
balls of radius g?q centered at each node w of G. 
To present Match, we first describe its procedures. 
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Input: Pattern graph Q with diameter <1q, data graph G(V, E). 
Output: The set of maximum perfect subgraphs of G w.r.t. Q. 

1. := 0; 

2. for each ball G[w, dg] in G do 

3. S w := DualSim(Q,G[«),dQ]); 

4. G s := ExtractMaxPG(Q, G[w, cIq], S w ); 

5. if G s ^ nil then 8 := 0U {G s }; 

6. return 0. 

Procedure DualSim(Q, G[w, cIq]) 

Input: Pattern graph Q(V q ,E q ) and ball G[w,cLq\. 

Output: The maximum match relation S w of Q and G[w,cLq\. 

1. for each v £ V q do 

2. sim(ti) := {u \ u is in G[w,cLq\ and Iq(u) = ^(f)}; 

3. while there are changes do 

4. for each edge (v, v') in and each node u £ sim(i)) do 

5. if there is no edge (u, u') in G[w,cLq\ with u' £ sim(V) 

6. then sim(i>) := sim(n) \ {«}; 

7. for each edge (v' t d) in Eq and each node m £ sim(i>) do 

8. if there is no edge (u',u) in G[w,cLq\ with u' £ sim(n') 

9. then sim(f) := sim(u) \ {u}; 

10. if sim(i)) = then return 0; 

11. S w := {(v, u) | V S V q , u S sim(u)}; 

12. return S w . 

Procedure ExtractMaxPG(Q, G[w, cIq],Su,) 

Input: Pattern Q, ball G[w,<1q], maximum match relation S w . 

Output: The maximum perfect subgraph G s in G[w, cIq] if any. 

1. if w does not appear in S w then return nil; 

2. Construct the matching graph G m w.r.t. S w ; 

3. return the connected component G s containing w in G m . 

Figure 3: Algorithm Match 
Procedure DualSim . It takes as input pattern graph 
Q{Vq,E q ) and ball G[w,oIq] with center w and radius 
cLq, and finds the maximum match relation S w of Q and 
G[w,(1q]. For each node v in V q , it first computes the set 
sim(ii) of candidate matches u in the ball with the same node 
label, i.e., W q {u) = W(v) (lines 1-2). Then the procedure 
repeatedly removes nodes from sim^) for each node v in Q 
(lines 3-10). A node u is removed from sim(u) unless (1) 
if there is a parent node v' of v, then there exists a parent 
node u £ sim(i/); and (2) if there is a child node v' of v, 
then there exists a child node u £ sim(?/). Finally, S w is 
returned (lines 11-12). 

Procedure ExtractMaxPG . It takes as input a pattern graph 
Q, ball G[w,(1q], and the maximum match relation S w . It 
finds the perfect subgraph G a in the ball if there exists one. 
By Theorem 2, the procedure simply finds the connected 
component containing w in the match graph w.r.t. S w . 

Algorithm Match. We are now ready to present Match. For 
each node w in the data graph G, (1) it computes the max- 
imum match relation S m of Q and the ball G[w,<1q] by in- 
voking DualSim (line 2); (2) it finds the perfect subgraph G s 
in G[w,<1q} via ExtractMaxPG (line 3); and (3) G s is added 
to the set if it exists (line 4). After all balls in G are 
checked, it returns the set of perfect subgraphs (line 5). 

Example 3: Consider pattern graph Qi (oIq 1 = 3) and 
the ball with center Bio4 and radius = 3 in data graph Gi of 
Fig 1. Note that the ball is exactly the connected component 
G c with node Bio4 in G\. We show how Algorithm Match 
works on Q\ and G c . Initially, HR, SE, Bio, Al and DM in Qi 
match {HR 2 }, {SE 2 }, {Bio 4 ,}, {A\[, Al 2 } and {DMi, DM 2 } in 
G c , respectively (lines 1-2, DualSim). The algorithm finds 



Input: Pattern graph Q = (V q , E q , Iq). 

Output: A minimized equivalent pattern graph Q m of Q. 

1. Compute the maximum match relation S of Q -<i) Q; 

2. Compute equivalent classes of nodes in Q based on S; 

3. Create a node for each equivalent class in CJrn ', 

4. Connect different equivalent classes by necessary edges in Q m ; 

5. return Q m . 

Figure 4: Algorithm minQ 

no nodes to be removed from sim(u) for all nodes u in Q\ in 
this case (lines 3-10, DualSim). Hence Match returns G c as 
the perfect subgraph in the ball (line 6, Match). □ 

Correctness & Complexity. The correctness of algorithm is 
assured by the following. (1) There is at most one perfect 
subgraph in each ball of G (Theorem 1). (2) Procedure 
ExtractMaxPG returns the perfect graph in ball G[v, oIq], by 
Theorem 2. (3) The correctness of DualSim can be verified 
along the same lines as its counterpart for simulation [24]. 

It takes BuildBall 0{\V\ + \E\) time to build a ball G[v, d Q ] 
by using the BFS method [16]. For each ball, ExtractMaxPG 
finds its perfect subgraph in OdVI) time since finding pair- 
wise disconnected components is linear-time equivalent to 
finding strongly connected components, which is in linear 
time [13]. By leveraging the algorithm developed in [24], 
DualSim can be done in 0((\V q \ + \E q \)(\V\ + \E\)) time. 
Thus Match is in 0{\V\(\V\ + (\V q \ + \E q \)(\V\ + \E\))) time. 

4.2 Optimization Techniques 

We next present optimization techniques for algorithm 
Match, by means of query minimization, dual simulation 
filtering and connectivity pruning. 

Query minimization. We first explore query minimiza- 
tion, which is important for any query language [3]. 

We say that two pattern graphs Q and Q' are equivalent, 
denoted by Q = Q' , iff they return the same result on any 
data graph. A pattern graph Q is minimum if it has the 
least size |Q| (the number of nodes and the number of edges) 
among all equivalent pattern graphs. 

By the complexity analysis of algorithm Match, smaller 
pattern graphs lead to better performance. 

Theorem 6 follows from Lemmas 2 and 3 given below. 

Lemma 2: For any pattern graph, (1) there exists a unique 
(up to isomorphism) minimum equivalent pattern graph, via 
dual simulation, that finds the same maximum match rela- 
tion on any data graph; and (2) there exists a quadratic time 
algorithm to find its minimum equivalent query. □ 

Lemma 3: When fixing the radius of balls in strong simula- 
tion, two pattern graphs are equivalent via strong simulation 
iff they are equivalent via dual simulation. □ 

Leveraging these, Algorithm Match can be improved as 
follows. Given query graph Q, we first compute its minimum 
equivalent query graph Q m , and then we compute strong 
simulation w.r.t. Q m and diameter dQ. 

Algorithm. As a proof of Lemma 2, we present Algorithm 
minQ for minimizing graph patterns, shown in Fig. 4. It 
takes as input a pattern graph Q, and returns a minimum 
equivalent pattern Q m of Q, via dual simulation. For any 
pattern graph Q, it first computes the maximum match re- 
lation S by treating Q as both a pattern graph and a data 
graph (line 1). It then computes equivalent classes for nodes 
in Q such that nodes u and v are in the same class iff both 
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Input: Pattern Q, relation S w.r.t. Q -<jj G, ball G[vi, (1q]. 
Output: The maximum perfect subgraph of G[w,cIq] w.r.t. Q. 

1. S w := project S onto G[iu,<f<p]; filterSet := 0; 

2. for each (u, u) £ 5™ such that v is a border node do 

3. if succ(?;) n sim(ui) = or pred(t>) n sim(ii2) = 

4. such that (u,u±) £ E q , (u 2 ,u) £ E q 

5. then /itter5ei.push((u, v)); 

6. while {filterSet ^ 0) do 

7. («, ^) := filterSet.popQ; S w := \ {("if)}; 

8. for each (u 2 ,u) £ -E 9 do 

9. for each v 2 £ pred(t)) n sim(u2) do 

10. if succ(i)2) H sim(«) = then 

11. /itter5et.push((ji2! "2)); 

12. for each (ti,«i) £ E q do 

13. for each v\ £ succ(d) n sim(ui) do 

14. if succ(i;) n sim(ui) = then 

15. filterSet. push((«i, v\))\ 

16. if there exists u in Q such that sim(ii) = then S w := 0; 

17. return ExtractMaxPG (Q, G[w,cIq], S w ). 

Figure 5: Algorithm dualFilter 

(u, v) € S and (v,u) £ S (line 2). Finally, it constructs the 
minimum equivalent query Q m as follows (lines 3-4). (a) 
For each equivalent class eq, it creates a node eq for Q m , 
and (b) there is an edge (eq, eq') in Q m iff there exist nodes 
u £ eq and u £ eq' such that there is an edge (u,u') in Q. 

Example 4: Taking as input the pattern graph Q 5 given 
in Fig. 6(a), Algorithm minQ works as follows. (1) It first 
computes the maximum match relation S of Q$ and Q5, via 
dual simulation, yielding S = {(R, R), (Bi, Bj), (d, Cj), 
(Di, Dj)} £ [1, 2]). (2) It then computes five equivalent 
classes: eq fl = {R}, eq A = {^4}, eq s = {Bi,B 2 }, eq c = 
{Ci,C2}, and eq D = {Di,D 2 }. (3) Finally, it constructs 
the minimum pattern graph Qs, m of Q5, shown in Fig. 6(a): 
(a) For each equivalent class eq^, where x £ {R, A, B, C, D}, 
it creates a node labeled with x; and (b) it creates an edge 
from node x to y in Qs t m iff there exist node u £ eq^. and 
v £ eq y such that (u, v) is an edge in Q5. □ 

Correctness & Complexity. The correctness of Algorithm 
minQ is assured by the following. (1) For any data graph G, 
the match graph w.r.t. the maximum match relation 5* of Q 
and G is always the same as the the match graph w.r.t. the 
maximum match relation S m of Q m and G. Hence, Q = Q m . 
(2) \Q m \ <\Q'\ for any Q' such that Q' = Q. (3) For any 
two minimum equivalent pattern graphs Q m and Q' m , there 
is a bijective mapping from Q m to Q' m such that (a) for any 
node u in Q m , f{u) is a node in Q' m with the same label, 
and (b) (it, v) is an edge in Q m iff (f(u), f(v)) is an edge in 
Q' m , e.g., Qm and Q' m are equivalent up to isomorphism. 

Algorithm minQ is in 0((\ V q \ + |-B«j|) 2 ) time. Indeed, steps 
(1), (2) and (3) of minQ take 0((\V q \ + \E q \f) time, 0(\V q \ 2 ) 
time and 0(|-E g |) time, respectively. 

Dual simulation filtering. Our second optimization tech- 
nique aims to avoid redundant checking of balls in the da- 
ta graph. Most algorithms of graph simulation (e.g., [24]) 
recursively refine the match relation by identifying and re- 
moving false matches. As observed in [19], it is much easier 
to deal with node or edge deletions than their insertions. In 
light of this, we compute the match relation of dual simula- 
tion first, and then project the match relation on each ball 
to compute strong simulation. This both reduces the initial 
match set sim(w) for each node v in Q (line 2, Dualsim), and 
reduces the number of balls (line 2, Match). Indeed, if a 
node v in G does not match any node in Q, then there is no 



need to consider the ball centered at v. 

Consider border nodes in a ball G[v, r], i.e., nodes u with 
dist(f, u) = r. We refer to those nodes reachable from a 
border node as affected nodes. One can verify the following: 

Proposition 5: The removal process on a ball only needs 
to deal with its border nodes and their affected nodes. □ 

This suggests an order to process nodes in G[v, r]: starting 
from its border nodes, we inspect affected nodes only follow- 
ing an increasing order based on their distances from border 
nodes. This minimizes unnecessary computation. Note that 
the border nodes can be marked when constructing balls. 
Hence this incurs little extra complexity. 

Algorithm. To do this, we first compute the match relation 
Sg, via dual simulation, over the entire data graph by in- 
voking Procedure DualSim in Fig. 3. We then project Sg 
onto each ball. When computing the perfect subgraph on 
a ball, we start with the border nodes, and identify invalid 
matches using Algorithm dualFilter shown in Fig. 5. 

We next present Algorithm dualFilter. It takes as input 
pattern graph Q, the maximum match relation S of Q and 
G that is found via dual simulation, and ball G[w,dg]. It 
returns the maximum perfect subgraph of G[w, rig] w.r.t. Q. 
More specifically, dualFilter first projects match relation Sg 
onto ball G[w,do\, yielding relation S w (line 1). It then it- 
eratively marks and removes those invalid matches stored in 
a queue filterSet (lines 2-15), initially empty. To do this, it 
first inspects those matches in S w that contain a border n- 
ode, to find invalid matches (lines 2-4). The invalid matches 
found are stored in filterSet (line 5). It then processes those 
marked invalid matches one by one (lines 6-15). Each in- 
valid match (u, v) with affected node v is removed from 
S w (line 7). The relation S w is then processed along the 
same lines as Algorithm Match (lines 8-15), but following 
the order of invalid matches in filterSet. Finally, the algo- 
rithm extracts the perfect subgraph by invoking Procedure 
ExtractMaxPG, and returns the subgraph (line 17). 

Example 5: Wc next illustrate how the filtering technique 
improves the performance of Algorithm Match by consider- 
ing pattern graph Qa and data graph Ga given in Fig. 6(b). 
The maximum match relation Sg 6 of Qe and Ge, via du- 
al simulation is the set of matches (node pairs) {(A, A2), 
{A, A3), (B,B 2 ), (B,B 3 ), (C,C)}. Hence initially, sim(A) 
= {A 2 , A 3 }, sim(S) = {B 2 , B 3 } and sim(C) = {C}. 

The filtering method then projects the match relation Sg 6 
on each ball, and checks the results. It finds the following. 

(i) There exist invalid matches in two balls: Ge(Ai,3) and 
Ge (Bi,3), by inspecting their border nodes. For Gq(Ai,3), 
after projecting Sg 6 on Ge(Ai,S), we get S w = {sim(j4) = 
sim(yl') = {^2}, sim(B) = sim(B') = {B 2 }}. Here B 2 is a 
border node of G[Ai,3]. Starting with B 2 , dualFilter finds 
that there exist invalid matches; similarly for Ge(5i,3). 

(ii) In contrast, there exist no invalid matches in ball- 
s G 6 (A 2 ,3), G 6 (B 2 ,3), G 6 (A 3 ,3), G 6 (B 3 ,3) and G 6 (C*,3). 
This is found by inspecting border nodes in each ball. Hence 
the final match relation in any of these balls is exactly the 
same as the initial projection of Sg 6 on the ball. 

As a result, only two balls (Ge(Ai,3) and Ge(-Bi,3)) are 
really processed by dualFiler, while no more processing is 
needed for the other five balls. That is, the filtering method 
prunes unnecessary processing and speeds up Match. □ 
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Correctness & Complexity. The correctness of Algorithm 
dualFiler is asserted by Proposition 5. For its complexity, 
observe that it takes 0(|V|(|V| + \E\j) time to construct 
all balls, and 0((\V q \ + \E q \)(\V\ + \E\)) time to compute 
the maximum match relation of Q and G via dual simula- 
tion, along the same lines as Algorithm Match. For each 
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1 1)) time. Putting these together, dualFilter takes 
0(l^l(l^l + (l^g| + l-E«l)(|V| + |.E|))) time in total. Although 
the worst case complexity is the same as the complexity of 
Match (shown in Fig. 3), as demonstrated by the example 
and as will be shown by our experimental study, the opti- 
mization technique is indeed effective in practice. 

Connectivity pruning. Theorem 2 tells us that in a ball 
G[u,r], only the connected component containing the bal- 
1 center v needs to be considered. Hence, those nodes not 
reachable from v can be pruned early. Our last main opti- 
mization technique does precisely this. It reduces the search 
space for checking dual-simulation, and can be easily incor- 
porated into Algorithm Match, as illustrated below. 

Example 6: Consider pattern graph Qi and data graph Gi 
shown in Fig. 6(c), in which diameters o]q 7 = 5 and dc 7 = 4. 
As oIq v > cLg 7 , a ball with any center node of G is exactly 
G itself. When conducting dual simulation of Qi on ball 
GV[j4i,5], for instance, the pruning method first finds an 
initial sim(v) set for each node v in Qi, by mapping At in 
Qi to Aj in G 7 [A lt 5] (i G [1,3], j G [1,2]). This yields two 
connected component in Gr[v4i,5]: SC\ containing nodes 
{A\, Bi} and SC2 containing {A2, B2}, in which only SCi 
contains the center node Ai (recall the notion of connected 
graphs from Section 2). By Theorem 2, the pruning method 
safely removes all those nodes that are not in SCi from 
sim(u), for any node u G Qi, without affecting the final 
matches. That is, it prunes invalid matches early and hence, 
improves the performance of algorithm Match. □ 

We have implemented a version of Match that supports all 
optimization strategies, referred to as Match + . As will be 
seen in Section 5, Match + significantly outperforms Match. 

4.3 Strong Simulation on Distributed Graphs 

When evaluating a query on a large dataset, one wants to 
partition the data and distribute its fragments to multiple 
machines, such that the query can be evaluated in paral- 
lel, as advocated by, e.g., MapReduce [15]. Moreover, it is 
common to find real-life datasets already partitioned and 
distributed. For instance, to find the complete information 
of a person, one may have to query several social networks 
{e.g., Facebook, Picassa and Youtube) to collect her data. 
These highlight the need for developing distributed algo- 
rithms for evaluating graph queries. However, as observed 
in [28] , graph algorithms often exhibit poor data locality and 
hence, may incur prohibitive overhead on network traffic. 



We next show that strong simulation demonstrates data 
locality and hence, allows efficient distributed evaluation. 

Data locality. Consider a graph G that is partitioned in- 
to (Gi, . . . , Gk) such that each Gj is stored in site Mi for 
i G [l,k]. We want to evaluate a pattern query Q on G, 
while minimizing unnecessary data shipment from one site 
to another. This is, however, rather challenging when pat- 
tern matching is defined in terms of graph simulation. 

Example 7: Consider again query graph Q\ and data graph 
Gi of Fig. 1. Let G s be the subgraph of Gi by removing 
the connected component with Bio4 from Gi. Suppose that 
G s is fragmented and distributed. Then to decide whether 
Qi -< G s , we have to ship all subgraphs of G s to a single site 
to re-assemble G s . Indeed, (1) the match graph of Q\ and 
G s via graph simulation is the entire G s ; and (2) removing 
any node or edge from G s makes Qi -jt, G s - This tells us 
that it is hard to conduct graph simulation in the distributed 
setting without incurring high network traffic. □ 

In contrast, we show that strong simulation has the data 
locality. Indeed, strong simulation can be computed in the 
distributed setting, guaranteeing that the total data ship- 
ment is bounded by the set of balls G[v, oIq] in G such that 
v is in some Gi but it has a direct neighbor node not in Gi. 

Algorithm. To verify the data locality of strong simulation, 
we outline a distributed algorithm for strong simulation. 

When a site, referred to as the coordinator, receives a 
pattern graph Q, it sends the same Q to each site Mi for 
i G [l,fc]. When a site Mi receives Q, it finds those balls 
G[v, do,], where v is in Gi but v has a neighbor node in an- 
other fragment Gj. For such nodes v, Mi sends G[v, do] to 
site Mj only if j < i. It then invokes algorithm Match to 
compute the matches of Q in Gi , as a partial result Qi of Q 
in G, at site Mi. It sends O; back to the coordinator. 

The coordinator monitors messages sent back from all 
sites Mi for i G [l,k]. When partial results are returned 
from all the sites, the coordinator assembles their partial 
results via union, and returns the final result. 

One can readily verify that the algorithm is correct, with 
the bound on network traffic mentioned above. Further- 
more, it is generic: it is applicable to any G regardless of 
how G is partitioned and distributed. 

5. Experimental Study 

We next present an experimental study of strong simu- 
lation. Using both real-life social networks and synthetic 
data, we conducted two sets of experiments to evaluate: (1) 
the effectiveness of strong simulation vs. conventional sub- 
graph isomorphism [34] and graph simulation [24], and (2) 
the efficiency of our centralized algorithm Match. 
Experimental setting. We used the following datasets. 
Real-life graph data. We used two real-life network datasets. 
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(1) Amazon records a product co-purchasing network with 
548,552 product nodes and 1,788,725 product-product di- 
rected edges 1 . An edge from products x to y indicates that 
people buy y with high probability when they buy x. 

(2) YouTube provides a video network with 155,513 video 
nodes and 3,110,120 video-video directed edges 2 . An edge 
from videos x to y indicates that if one watches x, then he 
is also very likely to watch y. 

Synthetic graph generator. We adopted the graph-tool li- 
brary 3 to produce both pattern and data graphs. It is con- 
trolled by three parameters: the number n of nodes, the 
number n a of edges, and the number / of node labels. Giv- 
en 7i, a, and I, the generator produces a graph with n nodes, 
n a edges, and the nodes are labeled from a set of I labels. 

Algorithms. We implemented the following algorithms, al- 
1 in Python: (1) our algorithms Match and Match + , (2) 
the graph simulation algorithm of [24], denoted by Sim, (3) 
the approximate matching algorithm TALE of [32], and (4) 
an approximate matching algorithm, denoted by MCS, that 
utilizes the approximation algorithm of [25] for computing 
maximum common subgraphs. We used the VF2 algorith- 
m [12] for subgraph isomorphism in the igraph package [14] . 

Consider pattern graph Q(V q ,E q ) and data graph G(V, 
E). For approximate matching algorithms TALE and MCS, 
there are essentially 2' v ' number of subgraphs of G to com- 
pare with Q, beyond reach in practice. Hence, we chose to 
compare the subgraphs of G having the same number of n- 
odes as Q. We adopted the same setting as [32] for TALE 
here. For MCS, a subgraph G S (V S , E a ) of G matches pat- 
tern graph Q if J^l\v^U\v!\) - °' 7 ' wnere |mcs(Q,G s )| is 
the number of nodes in the maximum common subgraph 
mcs(Q, G a ) of Q and G s computed via the algorithm of [25]. 

The experiments were run on a cluster of 30 machines, all 
with Intel Core i7 860 CPU and 16GB of memory. Each test 
was repeated over 5 times, and the average is reported here. 

Experimental results. We next present our findings. In 
all the experiments, we fixed / = 200, and set a — 1.2 by 
default when generating pattern and data graphs. 

Exp-1: Quality of matches. In the first set of experi- 
ments, we evaluated the quality of matches found by strong 
simulation vs. matches found by subgraph isomorphism and 
simulation. We measured the quality of matches as follows. 
(1) We first designed pattern graphs, and manually checked 
the quality of matches returned by Match, VF2 and Sim. 
We find that Match is able to identify sensible matches. We 
illustrate this with two example pattern graphs. 

Two real-life pattern graphs Qa and Qy are shown in 
Figures 7(a) and 7(b), respectively. Pattern graph Qa is to 
find all "Parenting & Families" books in Amazon network 
data (a) that are co-purchased with both "Children's Books" 
and "Home & Garden" books; and (b) that are co-purchased 
with "Health, Mind & Body" books and vice versa. 

Pattern graph Qy poses a request on YouTube network 
data searching for all "Entertainment" videos (a) that are 
related to "Film & Animation" videos and "Music" videos; 
and further, (b) for each such "Entertainment" video x, 
there is another "sports" video that is related to the "Film 



x http: / /snap. stanford.edu/data/index. html 
2 http: / /netsg.cs.sfu.ca/youtubedata/ 
3 http: / /projects. skewed.de/graph-tool/ 



& Animation" and "Music" videos to which x is related. 

In data graphs Ga and Gy, nodes are books and videos, 
respectively, labeled with their ids, and they only match the 
nodes of Qa and Qy with the same geometry shapes, e.g., 
circles, ellipses, and regular squares and pentagons. 

The match results of Qa and Qy are shown in Figures 7(a) 
and 7(b), respectively. For pattern graph Qa, subgraph Ga 
is a sensible match found by Match, but it was not found by 
VF2. Subgraph G' A is a match found by Sim in which the 
"Parenting &: Families" books are not co-purchased with 
both "Children's Books" and "Home & Garden" books, a- 
mong other things, and was successfully filtered by Match 
and VF2. These tell us that strong simulation is able to 
identify sensible matches that subgraph isomorphism fails 
to catch, and moreover, to eliminate excessive matches by 
graph simulation that do not make sense. 

For pattern graph Qy, subgraph Gy is a match found 
by Match, while subgraphs Gy,i, Gy,2 and Gy,3 are three 
matches found by VF2. This example shows how strong 
simulation reduces the sizes of matches found by subgraph 
isomorphism, without loss of information. 

(2) To further measure the quality of matches found, we use: 
closeness = #matches_sublso / #matches_found, 

where #matches_sublso and #matches_found are the total 
numbers of nodes in matches found by VF2 and those by 
a comparative algorithm (Sim, Match, VF2, TALE, MCS), 
respectively. Recall that matches found by VF2 are also 
matches found by Match and Sim, by Proposition 1. Hence 
closeness is essentially the ratio of matched nodes found by 
VF2 to the entire matches found by Sim, Match, VF2, TALE 
or MCS. Note that for VF2, closeness is always 1. 

(i) To evaluate the impact of pattern graphs Q, we fixed | V|, 
e.g., Amazon with 31245 nodes, YouTube with 9368 nodes, 
and synthetic data with 5 x 10 4 nodes, respectively, while 
varying \V q \ from 2 to 20. 

(ii) To evaluate the impact of data graphs G, we fixed pat- 
tern graphs Q with \V q \ = 10 and varied the size of data 
graphs. We varied |V| from 3 x 10 3 to 3 x 10 4 nodes for A- 
mazon and from 10 3 to 10 4 for YouTube. For synthetic data, 
we varied \V\ from 10 4 to 10 s . We used relatively smaller 
data graphs since VF2 does not scale to large graphs. 

The closeness results are reported in Figures 7(c), 7(d), 
7(e), 7(f), 7(g) and 7(h). We can see that the closeness of 
Match is consistently in the range of [70%, 80%] with various 
query and data graphs, while Sim is in [25%, 38%], TALE 
is in [35%, 42%], and MCS is in [46%, 57%], respectively. 
Hence, Match does much better than Sim (up to 50%), TALE 
(up to 36%) and MCS (up to 23%) at preserving graph topol- 
ogy. Indeed, 70% to 80% of the matches found by Match are 
exactly those found by VF2, which enforces strict topological 
matching. Recall that Match is able to find sensible matches 
missed by VF2 (Examples 1 and 2 and the above quality test 
(1)). That is, the [20%, 30%] matches found by Match, but 
missed by VF2, further contain sensible matches. 

In addition, the results tell us that the match quality of 
Match, Sim, TALE and MCS are not sensitive to the size 
of pattern and data graphs for both real-life and synthetic 
data, a desirable property when match quality is concerned. 

(3) In the same setting as (2) above for testing closeness, we 
tested the numbers of the matched subgraphs in data graphs 
returned by Match, VF2, TALE and MCS. We did not report 
Sim since it always returns at most one matched subgraph. 
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Figure 7: Match quality evaluation on real-life data 
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The results are reported in Figures 7(i), 7(j), 7(k), 7(1), 
7(m) and 7(n). They tell us that Match returns much less 
matched subgraphs than VF2: it returns consistently around 
25% to 38% matched subgraphs of VF2, for synthetic graph, 
Amazon and YouTube alike. For approximate matching 
algorithm TALE and MCS, it is obvious that they return 
even much more subgraphs than VF2. Indeed, as shown in 
Fig. 7(n), for example, Match returns 2144 matched sub- 
graphs compared to 4792 by VF2, 5843 by MCS and 7328 



by TALE, on a synthetic data graph with 10 5 nodes. This 
confirms that Match effectively reduces the sizes of match 
results, and hence, allows users to effectively analyze the 
match results on large graphs in practice. 

In addition, the number of matched subgraphs decreases 
when the size of pattern graphs increases, and it increases 
when the size of data graphs increases, as expected. We also 
find that although VF2 may find exponentially matches in 
theory, it does not happen very often in practice. 
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Figure 8: Performance evaluation of centralized algorithms 
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Table 3: Sizes of matched subgraphs 



(4) In the same setting as (2) for testing closeness with 
largest possible datasets, e.g., Amazon with 31245 nodes, 
YouTube with 9368 nodes, and synthetic data with 100000 
nodes, we tested the sizes of the matched subgraphs in data 
graphs returned by Match and Sim. 

For Sim, it returns a single matched subgraph with 103, 
177 and 311 nodes in Amazon, YouTube and synthetic data, 
respectively. For Match, the results are reported in Table 3. 
Their matched subgraphs are typically small, where (a) all 
matched subgraphs have less than 50 nodes, and (b) over 
80% of matches have less than 30 nodes, on real-life and 
synthetic data. This tells us that strong simulation indeed 
restricts the sizes of matches, due to the duality and locality. 

Exp-2: Performance of centralized algorithms. In the 

second set of experiments, we evaluated the performance 
of our algorithms Match, Match + and algorithms Sim and 
VF2. Algorithm VF2 does not scale well with large data 
graphs, e.g., it took VF2 more than three hours on data 
graphs with 5 x 10 6 nodes (when a = 1.2). Hence, we only 
report the performance of VF2 on the small real-life datasets 
of Amazon and YouTube that were used to evaluate the 
quality of matches. For large synthetic data graphs, we only 
report the other three algorithms Match, Match + and Sim. 
In all of our experiments, we also found that TALE and MCS 
were even much slower than VF2, and hence we did not 
report the running time of TALE and MCS here. 
(1) To evaluate the impact of pattern graphs Q, we used 
two small real-life datasets (Amazon and YouTube) and one 
large synthetic dataset. We fixed Amazon, YouTube and the 
synthetic data to have 3 x 10 4 nodes, 10 4 nodes and 5 x 10 6 
nodes, respectively, while varying the number \V q \ of query 
nodes from 2 to 20 or the density a q of pattern graphs from 
1.05 to 1.35 (i.e., increasing pattern edges). The results are 
reported in Figures 8(a), 8(b), 8(c) and 8(d). 

The elapsed time of algorithms over real-life datasets is 
shown in Figures 8(a) and 8(b). When varying \V q \, VF2 



is consistently much slower than the other three algorithms 
in both cases. It is about 100 times slower than Match + 
when V q > 4 on the two real-life datasets. For instance, it 
took VF2 hours on the small Amazon and YouTube datasets. 
Note that, however, when | V q \ = 2, VF2 is almost as efficient 
as the other algorithms. This is consistent with the complex- 
ity analysis of VF2: VF2 is in low ptime when \V q \ = 2. 

As shown in Fig. 8(c), all these algorithms scale well with 
\V q \ on large data graphs, except VF2. When we increased 
the density a q of pattern graphs, Figure 8(d) shows that 
these algorithms scale well with the density a q on large da- 
ta graphs, except VF2. Algorithms Match and Match + are 
slower than Sim, as expected. Indeed, this is a price that 
has to be paid in exchange for better match quality. We did 
not report the performance of VF2 in Figures 8(c) and 8(d) 
since it could not run to completion when \V q \ > 4. 

Finally, observe that the running time of all algorithms 
increases when \ V q \ or a q increases. This is consistent with 
the complexity analyses of these algorithms. 

(2) To evaluate the impact of data graphs G, we also used 
two small real-life datasets (Amazon and YouTube) and 
one large synthetic dataset. We fixed pattern graphs with 
\V q \ = 10, while varying the number |V| of nodes of Ama- 
zon, YouTube and the synthetic data from 6 x 10 3 to 3 x 10 4 , 
2 x 10 3 to 10 4 and 10 6 to 10 7 , respectively, or varying the 
density a of data graphs from 1.05 to 1.35. The results are 
shown in Figures 8(e), 8(f), 8(g) and 8(h). 

These results are consistent with the results of varying 
pattern graph sizes, (a) As shown in Figures 8(e), 8(f), 8(g) 
and 8(h), all these algorithms except VF2 scale well with the 
size of data graphs and with the density a of data graphs; 

(b) algorithms Match and Match" 1 " are slower than Sim; and 

(c) the running time of VF2 increases far more substantially 
with the size and density of data graphs than the others. For 
example, the running time of Match + increased from about 
100s to 600s when the number of nodes of the synthetic data 
varied from 10 6 to 10 7 ; in contrast, VF2 spent nearly 4000s 
on Amazon data with 3 x 10 4 nodes, but only around 30s 
on Amazon graphs with 3 x 10 3 nodes. 

(3) The experimental results in (1) and (2) above also verify 
that our optimization techniques are effective. Indeed, the 
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running time of Match" 1 " is consistently about 2/3 of the time 
taken by Match, a significant reduction. 

Summary. From these experimental results we find the 
following. (1) Strong simulation is able to identify sensible 
matches that are not found by subgraph isomorphism, and 
eliminate those found by graph simulation but are not mean- 
ingful. In addition, it finds high quality matches that retain 
graph topology. Indeed, 70%-80% of matches found by sub- 
graph isomorphism are retrieved by strong simulation, (up 
to 50%) better than graph simulation, without paying the 
price of intractable complexity and large number (or size) of 
matches. (2) Our algorithms for strong simulation are effi- 
cient and scale well with the size and density large-scale data 
graphs, e.g., it took 270 seconds when |V| = 10 s , \V q \ = 10 
and \ M\ = 30. (3) Our optimization techniques are effective, 
reducing the running time by at least 33%. 

6. Conclusion 

We have proposed strong simulation to rectify problems 
of graph pattern matching based on subgraph isomorphism 
and graph simulation. We have verified, both analytical- 
ly and experimentally, that strong simulation has several 
salient features, notably (1) it is capable of capturing the 
topological structures of pattern and data graphs; (2) it re- 
tains the same cubic-time complexity of previous extensions 
of graph simulation, (3) it demonstrates data locality and 
allows efficient distributed evaluation algorithms, and (4) it 
finds bounded matches. Our experimental results have also 
verified the effectiveness of our optimization techniques. 

Several topics are targeted for future work. First, we are 
to extend strong simulation by incorporating regular expres- 
sions on edge types, along the same lines as [18]. Second, 
our distributed algorithms just aim to demonstrate the da- 
ta locality of strong simulation. Sophisticated algorithm- 
s can be developed in the distributed setting, with better 
performance guarantees. Third, we are to find metrics to 
rank matches found by strong simulation, to return top- 
ranked matches only. Finally, for large graphs, cubic time 
is still too expensive. We are to explore indexing techniques 
to speed up the computation, and incremental methods for 
strong simulation, minimizing unnecessary recomputation in 
response to (frequent) changes to real-life graphs. 
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